Skip to content

Fix UTF-8 code units to match the number of bytes#4098

Merged
Earlopain merged 1 commit intoruby:mainfrom
Shopify:vs_fix_utf_8_code_units
Apr 30, 2026
Merged

Fix UTF-8 code units to match the number of bytes#4098
Earlopain merged 1 commit intoruby:mainfrom
Shopify:vs_fix_utf_8_code_units

Conversation

@vinistock
Copy link
Copy Markdown
Collaborator

@vinistock vinistock commented Apr 29, 2026

I was seeing some weird results for UTF-8 in the Ruby LSP and I realized we actually made a mistake when implementing the code unit offsets. The idea is for this method to return code units (not code points!). That means:

  • UTF-32: number of characters
  • UTF-16: the weird code unit count with 1/2 byte characters being length 1 and 3/4 byte characters length 2
  • UTF-8: number of bytes

Spec reference: see the comments above each position encoding kind.

We are actually not returning the number of bytes, but the string length, which is incorrect. It's easy to see the mistake in the tests: the location of an emoji (4 bytes) was 1 before (single character).

This PR makes sure that we're returning the number of bytes for UTF-8, which is the amount expected for code units.

Basically, I added a UTF-8 counter that just returns the number of bytes and tried to ensure naming consistency across the counters.

@vinistock vinistock self-assigned this Apr 29, 2026
@vinistock vinistock requested a review from kddnewton April 29, 2026 21:27
@vinistock vinistock force-pushed the vs_fix_utf_8_code_units branch 2 times, most recently from 8dd6d5b to 817ec5a Compare April 29, 2026 21:41
@kddnewton
Copy link
Copy Markdown
Collaborator

To get typecheck to pass you need to run BUNDLE_GEMFILE=gemfiles/typecheck/Gemfile bundle exec rake typecheck:rbs_inline typecheck:rbi

Comment thread lib/prism/parse_result.rb Outdated
@vinistock vinistock force-pushed the vs_fix_utf_8_code_units branch from 817ec5a to 846dc81 Compare April 30, 2026 13:27
@vinistock
Copy link
Copy Markdown
Collaborator Author

Just FYI: running the command generated other RBI/RBS changes that aren't related to this PR, so I think the files might be out of sync in main despite not causing any type checking failures.

I only included the updates for parse result.

@vinistock vinistock requested a review from Earlopain April 30, 2026 13:38
Copy link
Copy Markdown
Collaborator

@Earlopain Earlopain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks good, thanks!

@Earlopain Earlopain merged commit 442bd90 into ruby:main Apr 30, 2026
61 checks passed
@vinistock vinistock deleted the vs_fix_utf_8_code_units branch April 30, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants